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Abstract 

We present a simple adaptation of the Lempel Ziv 78' (LZ78) compression scheme 
{IEEE Transactions on Information Theory, 1978) that supports efficient random ac- 
cess to the input string. Namely, given query access to the compressed string, it is 
possible to efficiently recover any symbol of the input string. The compression algo- 
rithm is given as input a parameter e > 0, and with very high probability increases 
the length of the compressed string by at most a factor of (1 + e). The access time is 
0(logn + in expectation, and 0(logn/e^) with high probability. The scheme 
relies on sparse transitive-closure spanners. Any (consecutive) substring of the input 
string can be retrieved at an additional additive cost in the running time of the length 
of the substring. We also formally establish the necessity of modifying LZ78 so as to 
allow efficient random access. Specifically, we construct a family of strings for which 
rj(n/logn) queries to the LZ7 8 -compressed string are required in order to recover 
a single symbol in the input string. The main benefit of the proposed scheme is that 
it preserves the online nature and simplicity of LZ78, and that for every input string, 
the length of the compressed string is only a small factor larger than that obtained by 
running LZ78. 



1 Introduction 



As the sizes of our data sets are skyrocketing it is become important to allow a user to access 
any desired portion of the original data without decompressing the entire dataset. This 
problem has been receiving quite a bit of recent attention (see e.g. [[T4l |2l |71 [121 111 ISl O). 
Compression and decompression schemes that allow fast random-access decompression 
support have been proposed with the aim of achieving similar compression rates to the 
known and widely used compression schemes, such as arithmetic coding [|151 . LZ78 [|161 . 
LZ77 fJ3il and Huffman coding [11]. 
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In this work, we focus on adapting the widely used LZ78 compression scheme so as 
to allow fast random access support. Namely, given access to the compressed string and a 
location i in the original uncompressed string, we would like to be able to efficiently re- 
cover the £-th symbol in the uncompressed string. More generally, the goal is to efficiently 
recover a substring starting at location i\ and ending at location £2 in the uncompressed 
string. Previously, Lempel Ziv-based schemes were designed to support fast random ac- 
cess, in particular, based on LZ78 [[T4l . LZ77 [[T2l and as a special case of grammar-based 
compression [2]. 

The first basic question that one may ask is whether there is any need at all to modify the 
LZ78 scheme in order to support fast random access. We formalize the intuition that this 
is indeed necessary and show that without any modifications every (possibly randomized) 
algorithm will need time linear in the length of the LZ78-compressed string to recover a 
single symbol of the uncompressed string. 

Having established that some modification is necessary, the next question is how do we 
evaluate the compression performance of a compression scheme that is a modification of 
LZ78 and supports efficient random access. As different strings have very different com- 
pressibility properties according to LZ78, in order to compare the quality of a new scheme 
to LZ78, we consider a competitive analysis framework. In this framework, we require that 
for every input string, the length of the compressed string is a most multiplicative factor of 
(X larger than the length of the LZ78-compressed string, where a > 1 is a small constant. 
For a randomized compression algorithm this should hold with high probability (that is, 
probability 1 — l/poly(n) where n is the length of the input string). If this bound holds 
(for all strings) then we say that the scheme is a-competitive with LZ78. 

One additional feature of interest is whether the modified compression algorithm pre- 
serves the online nature of LZ78. The LZ78 compression algorithm works by outputting 
a sequence of codewords, where each codeword encodes a (consecutive) substring of the 
input string, referred to as a phrase. LZ78 is online in the sense that if the compression al- 
gorithm is stopped at any point, then we can recover all phrases encoded by the codewords 
output until that point. Our scheme preserves this property of LZ78 and furthermore, sup- 
ports online random access. Namely, at each point in the execution of the compression 
algorithm we can efficiently recover any symbol (substring) of the input string that has al- 
ready been encoded. A motivating example to keep in mind is of a powerful server that 
receives a stream of data over a long period of time. All through this period of time the 
server sends the compressed data to clients which can, in the meantime, retrieve portions 
of the data efficiently. This scenario fits cases where the data is growing incrementally, as 
in log files or user-generated content. 

1.1 Our Results 

We first provide a deterministic compression algorithm which is 3-competitive with LZ78 
(as defined above), and a matching random access algorithm which runs in time 0{\ogn), 
where n is the length of the input string. This algorithm retrieves any requested single 
symbol of the uncompressed string. By slightly adapting this algorithm it is possible to 
retrieve a substring of length s in time 0(log n) + s. 

Thereafter, we provide a randomized compression algorithm which for any chosen ep- 
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silon is (1 + e) -competitive with LZ78. The expected running time of the matching random 
access algorithm is 0(log?2 + l/e^), and with high probability is bounded by0O(logn/e^). 
The probability is taken over the random coins of the randomized compression algorithm. 
As before, a substring can be recovered in time that is the sum of the (single symbol) ran- 
dom access time and the length of the string. Similarly to LZ78, the scheme works in an 
online manner in the sense described above. The scheme is fairly simple and does not re- 
quire any sophisticated data structures. For the sake of simplicity we describe them for the 
case in which the alphabet of the input string is {0,1}, but they can easily be extended to 
work for any alphabet S. 

As noted previously, we also give a lower bound that is linear in the length of the 
compressed string for any random access algorithm that works with (unmodified) LZ78 
compressed strings. 

Experimental Results. We provide experimental results which demonstrate that our scheme 
is competitive and that random access is extremely efficient in practice. An implementation 
of our randomized scheme is available online [5}. 

1.2 Techniques 

The LZ78 compression algorithm outputs a sequence of codewords, each encoding a phrase 
(substring) of the input string. Each phrase is the concatenation of a previous phrase and 
one new symbol. The codewords are constructed sequentially, where each codeword con- 
sists of an index i of a previously encoded phrase (the longest phrase that matches a prefix 
of the yet uncompressed part of the input string), and one new symbol. Thus the code- 
words (phrases they encode) can be seen as forming a directed tree, which is a trie, with 
an edge pointing from each child to its parent. Hence, if a node v corresponds to a phrase 
Si, . . . , Sj, then for each 1 < j < t, there is an ancestor node of v that corresponds to the 
prefix si, . . . , sj, and is encoded by the codeword (z, Sj) (for some i), so that sj can be 
"revealed" by obtaining this codeword. 

In order to support random access, we want to be able to perform two tasks. The first 
task is to identify, for any given index what is the codeword that encodes the phrase to 
which the £-th symbol of the input string belongs. We refer to this codeword as the "target 
codeword". Let p denote starting position of the corresponding phrase (in the input string), 
then the second task is to navigate (quickly) up the tree (from the node corresponding 
to the target codeword) and reach the ancestor node/codeword at depth £ — p + 1 in the 
tree. This codeword reveals the symbol we are looking for. In order to be able to perform 
these two tasks efficiently, we modify the LZ78 codewords. To support the first task we 
add information concerning the position of phrases in the input (uncompressed) string. To 
support the second task we add additional pointers to ancestor nodes in the tree, that is, 
indices of encoded phrases that correspond to such nodes. Thus we (virtually) construct a 

'This bound can be improved to 0((logn/e + log(logn/e)), but this improvement comes at a 
cost of making the algorithm somewhat more complicated, and hence we have chosen only to sketch this 
improvement (see Subsection lB.2l i. 
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(very sparse) Transitive Closure (TC) spanner []T] on the tree. The spanner allow to navigate 
quickly between pairs of codes. 

When preprocessing is allowed, both tasks can be achieved more efficiently using auxil- 
iary data structures. Specifically, the first task can be achieved using rank and select queries 
in time complexity 0(1) (see e.g. [[T0|) and the second task can be achieved in time com- 
plexity O(loglogn) via level- ancestor queries on the trie (see e.g. [6|). However, these 
solutions are not adaptable, at least not in a straightforward way, to the online setting and 
furthermore the resulting scheme is not (1 + e) -competitive with LZ78 for every e. 

In the deterministic scheme, which is 3-competitive with LZ78, we include the ad- 
ditional information (of the position and one additional pointer) in every codeword, thus 
making it relatively easy to perform both tasks in time O(logn). In order to obtain the 
scheme that is (1 + e) -competitive with LZ78 we include the additional information only in 
an 0(e) -fraction of the codewords, and the performance of the tasks becomes more chal- 
lenging. Nonetheless, the dependence of the running time on n remains logarithmic (and 
the dependence on 1/e is polynomial). 

The codewords which include additional information are chosen randomly in order to 
spread them out evenly in the trie. It is fairly easy to obtain similar results if the structure of 
the trie is known in advance, however, in an online setting, the straightforward deterministic 
approach can blow up the size of the output by a large factor. 

1.3 Related Work 

Sadakane and Grossi (T^ give a compression scheme that supports the retrieval of any 
s-long consecutive substring of an input string S of length n over alphabet S in 0(1 + 
s/(log|2| n)) time. In particular, for a single symbol in the input string the running time 
is 0(1). The number of bits in the compressed string is upper bounded by nHk{S) + 

O n 1^1 ~^ log log where Hk{S) is the A;-th order empirical entropy of S. 

Since their compression algorithm builds on LZ78, the bound on the length of the com- 
pressed string for any given input string can actually be expressed as the sum of the length 
of the LZ78 compressed string plus 6 (n log log n/ logn) bits for supporting rank and se- 
lect operations in constant timcQ They build on the LZ78 scheme in the sense that they 
store suits of data structures that encode the structure of the LZ78 trie and support fast ran- 
dom access. Hence, for input strings that are compressed by LZ78 to a number of bits that 
is at least on the order of n log log n/ logn, their result is essentially the best possible as 
compared to LZ78. However, their scheme is not in general competitive (as defined above) 
with LZ78 because of its performance on highly compressible strings. We also note that 
their compression algorithm does not work in an online fashion, but rather constructs all 
the supporting data structures given the complete LZ78 trie. 

Two alternative schemes which give the same space and time bounds as in [fT4l were 
provided by Gonzalez and Navarro [9 1 and Ferragina and Venturini [jTJ, respectively. They 
are simpler, where the first uses an arithmetic encoder and the second does not use any com- 
pressor. (They also differ in terms of whether k has to be fixed in advance.) By the above 

^The Q{n log log n/ log n) space requirement can be decreased if one is willing to spend more than con- 
stant time. 
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discussion the performance of these schemes is not in general competitive with LZ78. 

Kreft and Navarro [12J provide a variant of LZ77 that supports retrieval of any s-long 
consecutive substring of S in 0(s) time. They show that in practice their scheme achieves 
close results to LZ77 (in terms of the compression ratio). However, the usage of a data 
structure that supports the rank and select operations requires ^l{n log log log n) bits. 

The Lempel-Ziv compression family belongs to a wider family of schemes called grammar- 
based compression schemes. In these schemes the input string is represented by a context- 
free grammar (CFG), which is unambiguous, namely, it generates a unique string. Billie 
et al. [2J show how to transform any grammar-based compression scheme so as to support 
random access in O(logn) time. The transformation increases the compressed representa- 
tion by a multiplicative factor (larger than 1). 

2 Preliminaries 

The LZ78 compression scheme. Before we describe our adaptation of the LZ78 scheme (T6\ , 
we describe the latter in detail. The LZ78 compression algorithm receives an input string 
X E S" over alphabet S and returns a list, = Cf^z^ of codewords of the form (i, 6), 
where i G N and 6 G S. Henceforth, unless specified otherwise, S = {0, 1}. Each code- 
word (i, h) encodes a phrase, namely a substring of x, which is the concatenation of the z-th 
phrase (encoded by C^'[i]) and h, where we define the 0-th phrase to be the empty string. 
The first codeword is always of the form (0, indicating that the first phrase consists of 
a single symbol The compression algorithm continues scanning the input string x and 
partitioning it into phrases. When determining the j-th phrase, if the algorithm has already 
scanned . . . , fc], then the algorithm finds the longest prefix x[k + 1, . . . ,n — 1] that is 
the same as a phrase with index z < j. If this prefix is a;[fc + 1, . . . , t], then the algorithm 
outputs the codeword (i, Xt+i) (if the prefix is empty, then i = 0). 

An efficient (linear in n) hZlS compression algorithm can be implemented by main- 
taining an auxiliary trie (as illustrated in Figure |2] Section O- The trie structure is implicit 
in the output of the LZ78 algorithm. Namely, for an input string x G {0, 1}", the trie 

= {V^, E^) is defined as follows. For each codeword C^[i], 1 < i < m there is a node 
Vi in V^, and there is also a node vq corresponding to the root of the tree. If C^[j] = (i, b), 
then there is an edge between vj and Vi (so that Vi is the parent of vj). Given the cor- 
respondence between codewords and nodes in the trie, we shall sometimes refer to them 
interchangeably. 

In the course of the compression process, when constructing the j-th codeword (af- 
ter scanning x[l, . . . ,k]) the compression algorithm can find the longest prefix of x[k + 
1, . . . ,n — 1] that matches an existing phrase i simply by walking down the trie. Once the 
longest match is found (the deepest node is reached), a new node is added to the trie. Thus 
the trie structure may be an actual data structure used in the compression process, but it 
is also implicit in the compressed string (where we think of a codeword C^[j] = (i, b) as 
having a pointer to its parent C^[i]). Decompression can also be implemented in linear 
time by iteratively recovering the phrases that correspond to the codewords and essentially 
rebuilding the trie (either explicitly or implicitly). In what follows, we refer to i as the index 
of C^[i] and to x[j] as the bit at position j. 
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Competitive schemes with random access support. We aim to provide a scheme. A, 

which compresses every input string almost as well as LZ78 and supports efficient local 
decompression. Namely, given access to a string that is the output of A on input x and 
I < ii < £2 < n, the local decompression algorithm outputs x[ii, . . . ,£2] efficiently. In 
particular, it does so without decompressing the entire string. We first describe our scheme 
for the case where ii — £2, which we refer to as random access, and later explain how to 
extend the scheme for £1 < £2. The quality of the compression is measured with respect to 
LZ78, formally, we require the scheme to be competitive with LZ78 as defined next. We 
note that here and in all that follows, when we say "with high probability" we mean with 
probability at least 1 — l/poly(n). 

Definition 1 (Competitive schemes) Given a pair of deterministic compression algo- 
rithms A : {0, 1}* -)■ {0, 1}* and B : {0, 1}* {0, 1}*, we say that algorithm B is 
a-competitive with A if for every input string x e {0,1}*, we have \C§\ < a\C^\, where 
Cg and are the compressed strings output by A and B, respectively, on input x. For a 
randomized algorithm B, the requirement is that |Cg| < Q;|C^| with high probability over 
the random coins ofB. 

Word RAM model. We consider the RAM model with word size log n + 1, where n is 
the length of the input string.We note that it suffices to have an upper bound on this value 
in order to have a bound on the number of bits for representing any index of a phrase. A 
codeword of LZ78 is one word, i.e., i and b appear consecutively where i is represented by 
logn bits. For the sake of clarity of the presentation, we write it as {i, h). Our algorithms 
(which supports random access) use words of size logn + 1 as well. If one wants to 
consider variants of LZ78 that apply bit optimization and/or work when an upper bound 
on the length of the input string is not known in advance, then our algorithms need to be 
modified accordingly so as to remain competitive (with the same competitive ratio). 

We wish to point out that if we take the word size to be log m + 1 (instead of log n + 
1), where m is the number of phrases in the compressed string, then our results remain 
effectively the same. Specifically, in the worst case the blow up in the deterministic scheme 
is of factor of 4 (instead of 3) and in the randomized scheme is of factor (1 + 2e) (instead 
of(l + e)). 

3 A Deterministic Scheme 

In this section we describe a simple deterministic compression scheme which is based on 
the LZ78 scheme. 

In the deterministic compression scheme, to each codeword we add a pair of additional 
entries. The first additional entry is the starting position of the encoded phrase in the 
uncompressed string. On an input x E {0, 1}" and I < £ < n, this allows the algorithm to 
efficiently find the codeword encoding the phrase that contains the £-th bit by performing 
a binary search on the position entries. The second entry we add is an extra pointer (we 
shall use the terms "pointer" and "index" interchangeably). Namely, while in LZ78 each 
codeword indicates the index of the former codeword, i.e., the direct parent in the trie, (see 
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Section [2l), we add another index, to an ancestor node/codeword (which is not the direct 
parent). In order to allow efficient random access, our goal is to guarantee that for every 
pair of connected nodes, u, v there is a short path connecting u and v. Namely, if we let 
dG{u,v) denote the length of the shortest path from m to f in a directed graph G, then 
the requirement is that for u,v such that dG{u,v) < oo it holds that dG{u,v) is small. 
Before we describe how to achieve this property on (a super-graph of) the constructed trie 
we describe how to guarantee the property on a simple directed path. Formally we are 
interested in constructing a Transitive-Closure (TC) spanner, defined as follows: 

Definition 2 (TC-spanner lUl) Given a directed graph G = {V, E) and an integer k > 1, 
a k-transitive-closure-spanner (k-TC-spanner) ofG is a directed graph H = (V, Eh) with 
the following properties: 

1. Eh is a subset of the edges in the transitive closur^ ofG. 

2. For all vertices u,v E V, ifdciu, v) < oo, then dniu, v) < k. 

3.1 TC Spanners for Paths and Trees 

Let Cn = (V, E) denote the directed line (path) over n nodes (where edges are directed 
"backward"). Namely, V = {0, . . . , n - 1} and E = - 1) : 1 < i < n - 1}. Let 

fn{i) == i mod [logrij and let E' = {(i,max{i — 2-^"*^*) ■ [lognJjO}) : 1 < i < n — 1}. 
Observe that each node 1 < i < n — 1 has exactly one outgoing edge in E' (in addition to 
the single outgoing edge in E). Define Hn = {V,EU E'). 

Claim 1 l-in is a (4 \ogn)-TC-spanner of Cn- 

Proof: For every 0<r<t<r2 — 1, consider the following algorithm to get from t to r 
(at each step of the algorithm stop if r is reached): 

1. Starting from t and using the edges of E, go to the first node u such that /„(m) = 
[lognj — 1. 

2. From u iteratively proceed by taking the outgoing edge in E' if it does not go beyond 
r (i.e., if the node reached after taking the edge is not smaller than r), and taking the 
outgoing edge in E otherwise. 

Clearly, when the algorithm terminates, r is reached. Therefore, it remains to show that the 
length of the path taken by the algorithm is bounded by 4 \ogn. Let a{i) denote the node 
reached by the algorithm after taking i edges in E starting from u. Therefore, a(0) = u 
and fn{a{i)) = [lognj — 1 — i for every < i < [lognj and i < s, where s denotes the 
total number of edges taken in E starting from u. For every pair of nodes w > q define 
g{w, q) = [{w — q)/ [log n\ J , i.e., the number of complete blocks between w and q. Thus, 
g{a{i),r) is monotonically decreasing in i, for i < s. Consider the bit representation of 
g{a{i),r). If from node a{i) the algorithm does not take the edge in E' it is implied that 

^The transitive closure of a graph G = {V,E) is the graph H — (V',E') where V' — V and E' = 

{{u,v) : daiujv) < oo}. 
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the j-th bit in g{a{i),r) is for every j > fn{a{i)). On the other hand, if from node 
a{i) the algorithm takes the edge in E' then after taking this edge the /„(a(i))-th bit turns 
0. Therefore by an inductive argument, when the algorithm reaches a{i), g{a{i),r) is 
for every j > fn{a(i)). Thus, g(a(min{ [lognj — 1, s}), r) = 0, implying that the total 
number of edges taken on E' is at most log n. Combined with the fact that the total number 
of edges taken on E in Step [2l is bounded by 2 log n and the fact that the total number of 
edges taken on E in Step [T] is bounded by log n, the claim follows. ■ 

From Claim [Hit follows that for every m < n, V = {0, . . . , m}, E = {(i, i — 1) : 
1 < i < m - 1} and E' = {(z,max{i - 2^"W ■ [lograj,0}), {V, E U E') is a (41ogn)- 
TC-spanner of Cm- This implies a construction of a (4 logn)-TC-spanner for any tree on 
n nodes. Specifically, we consider trees where the direction of the edges is from child to 
parent (as defined implicitly by the codewords of LZ78) and let d{v) denoted the depth of 
a node v in the tree (where the depth of the root is 0). If in addition to the pointer to the 
parent, each node, v, points to the ancestor at distance 2'^"*^'^*^'^^^ ■ [lognj (if such a node 
exists), then for every pair of nodes u, f on a path from a leaf to the root, there is a path of 
length at most 4 log n connecting u and v. 

We note that using fc-TC-spanners with k = o(log n) will not improve the running time 
of our random access algorithms asymptotically (since they perform an initial stage of a 
binary search). 

3.2 Compression and Random Access Algorithms 

As stated at the start of this section, in order to support efficient random access we modify 
the codewords of LZ78. Recall that in LZ78 the codewords have the form {i,b), where 
i is the index of the parent codeword (node in the trie) and b is the additional bit. In the 
modified scheme, codewords are of of the form W = {p, i, k, b), where i and b remain the 
same, p is the starting position of the encoded phrase in the uncompressed string and k 
is an index of an ancestor codeword (i.e., encoding a phrase that is a prefix of the phrase 
encoded by W). As in LZ78, our compression algorithm (whose pseudo-code appears in 
Algorithm [H Subsection lA.il) maintains a trie T as a data structure where the nodes of the 
trie correspond to codewords encoding phrases (see Section [2l). Initially, T consists of a 
single root node. Thereafter, the input string is scanned and a node is added to the trie for 
each codeword that the algorithm outputs, giving the ability to efficiently construct the next 
codewords. The data structure used is standard: for each node the algorithm maintains the 
index of the phrase that corresponds to it, its depth, and pointers to its children. 

Given access to a compressed string, which is a list of codewords C[l, . . . , m], and an 
index 1 < i < n, the random access algorithm (whose pseudo-code appears in Algorithm^ 
Subsection lA.il) first performs a binary search (using the position entries in the codewords) 
in order to find the codeword, C[t], which encodes the phrase x[ii, . . . , £2] containing the i- 
th bit of the input string x (i.e., ii < i < £2)- The algorithm then reads O(logn) codewords 
from the compressed string, using the parent and ancestor pointers in the codewords, in 
order to go up the trie (implicitly defined by the codewords) to the node at distance £2 — £ 
from the node corresponding to C[t]. The final node reached corresponds to the codeword, 
C[r] = {pr, ir, kr, br), wMch encodes the phrase x\pr, . . . , i — £1 + I] = x[£i ....,£] and so 
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the algorithm returns br- 

The next theorem follows directly from the description of the algorithms and Claim [T] 

Theorem 1 Algorithm U] (compression algorithm) is 3-competitive with LZ78, and for 
every input x G {0, 1}", the running time of Algorithm^ (random access algorithm) is 
0{\ogn). 

Recovering a substring. We next describe how to recover a consecutive substring 
x[£i, . . . ,£2], given the compressed string C[l, . . . ,m]. The idea is to recover the sub- 
string in reverse order as follows. Find the codeword, C[k] encoding the substring (phrase) 
x[ti, ■ ■ ■ ,t2] such that ti < ^2 < h as in Step [T] of Algorithm |2l Then, as in Step |2] of 
Algorithm |2] find the codeword, C[t\, which encodes x[ti, . . . , £2]- From C[t] recover the 
rest of the substring (x[ti, . . . , £2 — 1]) by going up the trie. If the root is reached before 
recovering £2 — ^1 + 1 bits (i.e., ii < ti), then continue decoding C[k — 1], C[k — 2], . . . 
until reaching the encoding of the phrase within which x[ii] resides. The running time is 
the sum of the running time of a single random access execution, plus the length of the 
substring. 

4 A Randomized Scheme 

In this section we present a randomized compression scheme which builds on the determin- 
istic scheme described in Section |3] In what follows we describe the randomized compres- 
sion algorithm and the random access algorithm. Their detailed pseudo-codes are given in 
Algorithmic (see Subsection lA. 21 ) and Algorithmic (see Subsection lA.il) . respectively. Re- 
covering a substring is done in the same manner as described for the deterministic scheme. 

We assume that e = ^l{logn/^y\ogn) (or else one might as well compress using LZ78 
without any modifications). 

The high-level idea of the compression scheme. Recall that the deterministic compres- 
sion algorithm (Algorithm H), which was 3-competitive, adds to each LZ78 codeword two 
additional information entries: the starting position of the corresponding phrase, and an 
additional index (pointer) for navigating up the trie. The high level idea of the randomized 
compression algorithm, which is (1 + e) -competitive, is to "spread" this information more 
sparsely. That is, rather than maintaining the starting position of every phrase, it maintains 
the position only for a 6(e)-fraction of the phrases, and similarly only 0(e)-fraction of 
the nodes in the trie have additional pointers for "long jumps". While spreading out the 
position information is done deterministically (by simply adding this information once in 
every 0(l/e) codewords), the additional pointers are added randomly (and independently). 
Since the trie structure is not known in advance, this ensures (with high probability) that 
the number of additional pointer entries is 0(e) times the number of nodes (phrases), as 
well as ensuring that the additional pointers are fairly evenly distributed in each path in the 
trie. We leave it as an open question whether there exists a deterministic (online) algorithm 
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that always achieves such a guarantee 0. 

Because of the sparsity of the position and extra-pointer entries, finding the exact phrase 
to which an input bit belongs and navigating up the trie in order to determine this bit, is 
not as self-evident as it was in the deterministic scheme. In particular, since the posi- 
tion information is added only once every 0(l/e) phrases, a binary search (similar to the 
one performed by the deterministic algorithm) for a location i in the input string does not 
uniquely determine the phrase to which the £-th bit belongs. In order to facilitate finding 
this phrase (among the 0(l/e) potential candidates), the compression algorithm adds one 
more type of entry to an 0(e) -fraction of the nodes in the trie: their depth (which equals the 
length of the phrase to which they correspond). This information also aids the navigation 
up the trie, as will be explained subsequently. 

A more detailed description of the compression algorithm. Similarly to the determin- 
istic compression algorithm, the randomized compression algorithm scans the input string 
and outputs codewords containing information regarding the corresponding phrases (where 
the phrases are the same as defined by LZ78). However, rather than having just one type of 
codeword, it has three types: 

• A simple codeword of the form (i, b), which is similar to the codeword LZ78 outputs. 
Namely, i is a a pointer to a former codeword (which encodes the previously encoun- 
tered phrase that is the longest prefix of the current one), and 6 is a bit. Here, since 
the length of the codewords is not fixed, the pointer i indicates the starting position 
of the former codeword in the compressed string rather than its index. We refer to i 
as the parent entry, and to h as the value entry. 

• A special codeword, which encodes additional information regarding the correspond- 
ing node in the trie. Specifically, in addition to the entries i and 6 as in a simple code- 
word, there are three additional entries. One is the depth of the corresponding node, 
V, in the tree, and the other two are pointers (starting positions in the compressed 
string) to special codewords that correspond to ancestors of t>. We refer to one of 
these entries as the special^arent and the other as the special Mncestor. Details of 
how they are selected are given subsequently. 

• A position codeword, which contains the starting position of the next encoded phrase 
in the uncompressed string. 

In what follows we use the term word (as opposed to codeword) to refer to the RAM words 
of which the codewords are built. Since codewords have different types and lengths (in 
terms of the number of words they consist of), the compression algorithm adds a special 

"^The simple idea of adding an extra pointer to all the nodes whose depth is divisible by = 9(l/e), 
excluding nodes with height smaller than fc, will indeed ensure the even distribution on each path. However, 
since we do not know the height of each node in advance, if we remove this exclusion we might cause the 
number of additional pointers to be too large, e.g., if the trie is a complete binary tree with height divisible by 
fc, then every leaf gets an additional pointer 
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delimiter word before each special codeword and (a different one) before each position 
codeword Jfl 

The algorithm includes a position codeword every c/e words (where c is a fixed con- 
stant). More precisely, since such a word might be in the middle of a codeword, the position 
codeword is actually added right before the start of the next codeword (that is, at most a 
constant number of words away). As stated above, the position is the starting position of 
the phrase encoded by the next codeword. 

Turning to the special codewords, each codeword that encodes a phrase is selected to be 
a special codewords independently at random with probability e/c. We refer to the nodes 
in the trie that correspond to special codewords as special nodes. Let m be a special node 
(where this information is maintained using a Boolean-valued field named 'special'). In 
addition to a pointer i to its parent node in the trie, it is given a pointer q to its closest 
ancestor that is a special node (its special parent) and a pointer a to a special ancestor. The 
latter is determined based on the special depth of u, that is, the number of special ancestors 
of u plus 1, similarly to the way it is determined by the deterministic algorithm. Thus, the 
special nodes are connected among themselves by a TC- spanner (with out-degree 2). 

A more detailed description of the random access algorithm. The random access al- 
gorithm Algorithmic is given access to a string S, which was created by the randomized 
compression algorithm. Algorithm |3] This string consists of codewords . . . , C[m] 
(of varying lengths, so that each C[j] equals S[r, . . . ,r + h] for h E {0, 1, 4}). Similarly 
to Algorithm |2] for random access when the string is compressed using the deterministic 
compression algorithm. Algorithm HI the algorithm for random access when the string is 
compressed using the randomized compression algorithm, consists of two stages. Given an 
index 1 < i < n, in the first stage the algorithm finds the codeword that encodes the phrase 
x[ii, . . . ,£2] to which the i-th bit of the input string x belongs (so that ii < i < £2)- In 
the second stage it finds the codeword that encodes the phrase x[ii, . . . ,i] (which appeared 
earlier in the string), and returns its value entry (i.e., the bit b). 

Recall that on input i and C[l, . . . , m]. Algorithmic] (in StepH]) first finds the codeword 
that encodes the phrase to which the i-th bit of the input string belongs by performing a 
binary search. This is done using the position entries, where each codeword has such an 
entry. However, in the output string of the randomized compression scheme it is no longer 
the case that each codeword has a position entry. Still, the random access algorithm can 
perform a binary search over the position codewords. Recall that the randomized compres- 
sion algorithm places these codewords at almost fixed positions in the compresses string 
(namely, at positions that are at most a constant number of words away from the fixed posi- 
tions), and these codewords are marked by a delimiter. Hence, the algorithm can find two 
position codewords, C[k] and C[q], such that q — i = 0(l/e) and such that i is between 
the positions corresponding to these codewords. This implies that the requested bit x[i] 
belongs to one of the phrases associated with the codewords C[k + 1], . . . , C[q — 1]. 

In order to find the desired codeword C[t] where k < t < q, the algorithm calculates 
the length of the phrase each of the codewords C[k + 1], . . . , C[q — 1] encodes. This length 

^In particular, these can be the all-1 word and the word that is all-1 with the exception of the last bit, 
which is 0. This is possible because the number of words in the compressed string is 0{n/ log n). 
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equals the depth of codeword (corresponding node) in the trie. If a codeword is a special 
codeword, then this information is contained in the codeword. Otherwise (the codeword is 
a simple codeword), the algorithm computes the depth of the corresponding node by going 
up the trie until it reaches a special node (corresponding to a special codeword). Recall 
that a walk up the tree can be performed using the basic parent pointers (contained in both 
simple and special codewords), and that each special codeword is marked by a delimiter, so 



that it can be easily recognized as special. (For the pseudo-code see Procedure Find-Depth 
in Subsection |A.2[ ) 

Let the phrase encoded by C[t] he x[ii, . . . , £2] (where £1 < i < £2)- In the second 
stage, the random access algorithm finds the codeword, C[r], which encodes the phrase 
x[ii, . . . ,i] (and returns its value entry, b, which equals x[£]). This is done in three steps. 
First the algorithm uses parent pointers to reach the special node, v, which is closest to 
the node corresponding to C[t\. Then the algorithm uses the speciaLparent pointers and 
speciaLancestor pointers (i.e., TC-spanner edges) to reach the special node, v', which is 
closest to the node corresponding to C[r] (and is a descendent of it). This step uses the 
depth information that is provided in all special nodes in order to avoid "over- shooting" 
C[r]. (Note that the depth of the node corresponding to C[r] is known.) Since the special 
nodes v and v' are connected by an 0(log ra) -TC-spanner, we know (by Claim [T]) that there 
is a path of length O(logri) from v to v'. While the algorithm does not know what is the 
depth of v', it can use the depth of the node corresponding to C[r] instead to decide what 
edges to take. In the last step, the node corresponding to C[r] is reached by taking (basic) 
parent pointers from v'. 

Theorem 2 Algorithm\3\is (1 + e) -competitive with LZ78 and for every input x G {0, 1}", 
the expected running time of Algorithm^is 0(log n + lj e^). With high probability over the 
random coins ofAlgorithm\3\the running time of Algorithm^is bounded by 0(logn/e^). 

Proof: For an input string x E {0, 1}", let w{x) be the number of codewords (and hence 
words) in the LZ78 compression of x, and let w'{x) be the number of words obtained 
when compressing with Algorithm |3] (so that w'{x) is a random variable). Let m[{x) be 
the number of simple codewords in the compressed string, let m2{x) be the number of 
special codewords, and let 7713(0;) be the number of position codewords. Therefore, w'{x) = 
m[{x) + 5m2{x) + 2m'^{x). By construction, m[{x) + m2{x) = w{x), and so w'{x) = 
w{x) + 4:Tn2{x) + 2m'^{x). Also by construction we have that m'^{x) = ew'(x)/40, so that 
w'{x) = !!!M±^^^_ Since each phrase is selected to be encoded by a special codeword 
independently with probability e/40, by a multiplicative Chemoff bound, the probability 
that more than an (e/20)-fraction of the phrases will be selected, i.e., m'2{x) > {e/20)w{x) 
is bounded by exp(— fi(ew(x))) < exp(—Q(e^/n)) (since w{x) > \/n). Therefore, with 
high probability (recall that we may assume that e > c \og{n) / ^Jn for a sufficiently large 
constant c) we get that w'{x) < jz^t^ ■ w{x) < (1 + e)w{x). Since the analysis of the 
running time is easier to follow by referring to specific steps in the pseudo-code of the 
algorithm (see Subsection I A. 21 ) we refer the reader to Subsection IB. 1 1 for the rest of the 
proof. ■ 
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5 A Lower Bound for Random Access in LZ78 



In what follows we describe a family of strings, x G {0, 1}", for which random access 
to X from the LZ78 compressed string, = Cf^i, requires fidC^^I) queries, where \C^'\ 
denotes the number of codewords in . We construct the lower bound for strings, x, such 
that \C^\ = VL{n/ \ogn) (Theorem |3]) and afterwards extend (Theorem |4) the construction 
for general n and m, where n denotes the length of the uncompressed string and m denotes 
the number of codewords in the corresponding compressed string. Note that m is lower 
bounded by V^{^/n) and upper bounded by 0{n/ \ogn). Consider the two extreme cases, 
the case where the trie, T^, has a topology of a line, for example when a; = 01 01^ . . . Ol-^ . 
In this case \C^\ = ^l(^/n)■, the case where the trie is a complete tree, corresponding for 
example to the string that is a concatenation of all the strings up to a certain length, ordered 
by their length. In the latter case, from the fact that is a complete binary tree on m + 1 
nodes it follows that x is of length ©(mlogm), thus \C^\ = 0{n/ logri). 

The idea behind the construction is as follows. Assume m = 2*^ — 1 for some G Z+ 
and consider the string S' = 0100011011 OOP . . . namely, the string that contains 
all strings of length at most k — 1 ordered by their length and then by their lexicographical 
order. Let denote the string that is identical to S except for the £-th order string, s, 
amongst strings with prefix 01 and length k — \. We modify the prefix of s from 01 to 00 
and add an arbitrary bit to the end of s. The key observation is that the encoding of S and 
differs in a single location, i.e. a single codeword. Moreover, this location is disjoint for 
different values of (. and therefore implies a lower bound of VL{m) as formalized in the next 
theorem. 

Theorem 3 For every m = 2^ — 2 where k G Z+, there exist n = G (mlogm), an index 
< i < n and a distribution, T>, over {0, 1}" U {0, 1}"+^ such that 

1. \C^\ = mfor every x 

2. Every algorithm A for which it holds that Pixev [-^(C^) = Xi] > 2/3 must read 
0.(2'') codewords from C^. 

Proof: Let x o y denote x concatenated to y and Oi=i"5i denote Si o S2 ■ ■ ■ o St- Define 
S = Oi^i (^Qj=is{i,j)^ where s{i,j) is the j-th string, according to the lexicographical 

order, amongst strings of length i over alphabet {0, 1}. For every 1 < £ < q = 2^-^l^ 
define S"- = Of=i (Of=iS^{hj)^ where s^{i,j) = s(/i; - 1, 1) o for i = k - 1 and 

j = q + i and s^{i,j) = s{i,j) otherwise. Define =^ C^'[2' — 1 + j]. Therefore, C^j 
corresponds to the j-th node in the z-th level of the T^, i.e. Cf^ = s{i,j) (see Figure |5l 

Section O- Thus ^ Cg for = {k-l,q + i) and C^j = Cg otherwise. We 
define V to be the distribution of the random variable that takes the value S with probability 
1/2 and the value with probability l/(2£) for every I < i < q. We first argue that for 
some absolute constant i] < 0, for every algorithm. A, which for an input takes r]\C^\ 
queries from C'-", it holds that PiRev [^(<^^) 7^ A^^)] < 1/6- This follows from the 
combination of the fact that q = Q,{\C^\) and the fact that A must query the compressed 
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string on the £-th location in order to distinguish from S. To complete the proof we 
show that there exists < i < n such that Pr/jgx? [Cf = Cf^] =1/2, namely, show that 
Cf 7^ C^' for every 1 < i < q. Since the position of the phrases of length k — 1 with 
prefix 1 is shifted by one in with respect to S we get that the above is true for ^7(|C'^|) 
bits. In particular, Cf 7^ Cf ^ holds for every bit, Xi, that is encoded in the second to last 
position of a phrase of length k — 1 with prefix 1 and suffix 01. ■ 
Theorem|3]can be extended as follows: 

Theorem 4 For every rh and h such that in log m < n < fh? there exist: 

1. m = 0(m) andn = Q{h) 

2. a distribution, V, over {0, 1}" U {0, 1}"+^ 

3. an index < i < n 

such that ConditionsU}and^in Theorem\3\hold. 

Proof: Set k = [logm], t = [v^] and let m = 2^ — 1 + t. Define 5' = 
Oti' (O,'Li(0 o si^,J))) OU I' and S' = Oti' (OfLi(0 o s%j))) QLi l^ There- 
fore n = 6(/c2^ + t^) = Q{h). The rest of the proof follows the same lines as in the proof 
of Theorem m ■ 

6 Experimental Results 

Our experiments show that on selected example files our scheme is competitive in practice 
(see Figure[T). Our results are given below in terms of the fraction of special codewords, a, 
which is directly related to e (see Theorem|2]). We ran the scheme with a = 1/4, 1/8,1/16. 
The data points corresponding to a = plot the file size resulting from standard LZ78. 

With respect to the random access efficiency, we found that on average the time re- 
quired for random access is less than 1 millisecond while decompressing the entire file 
takes around 300 milliseconds. 

Acknowledgment. We would like to thank Giuseppe Ottaviano for a helpful discussion 
and two anonymous referees for their constructive comments . 
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A Pseudo-code 

A.l Deterministic Scheme Pseudo-code 



Algorithm 1: Deterministic Compression Algorithm 
Input: X e {0, 1}" 

Initialize T to a single root node; p := 1, j := 1. 
While (j) < n) 

1. Find a path in T from the root to the deepest node, v, which corresponds to a prefix 
of . . . , n — 1], where corresponds to the left child and 1 corresponds to the 
right child. 

2. Create a new node u and set index := j, m. depth := v. depth + 1. 

3. If + t>. depth] = set t>.left := u and otherwise set t>. right := u. 

4. Let a be the ancestor of m in T at depth max{n. depth — 2-^"(" '^'^p*'^) ■ [log n\ , 0}. 

5. Output {p, f .index, a. index, x\p + m. depth]). 

6. p := p + M. depth + 1. 
V. J ■■= J + 1. 



Algorithm 2: Random Access Algorithm for Deterministic Scheme 
Input: C[l] = {pi, ii, ki, bi), . . . C[m] = i^, km, &m), which represents a string 
compressed by Algorithm [T] and an index 1 < i < n 

1. Perform a binary search on pi, . . . , and find pt such that 

pt = maxi<i<„{pi < i}. 

2. Find the codeword, C[r] = (pr,ir, kr,br), which correspond to the ancestor of 
C[t] = {pt, it, kt, ht) at depth ^ — pt + I'm the trie. This is done as described in the 
proof of Claim [H using the pointer information in the codewords/nodes (observe that 
the depth of C\t] is pt+i — Pt)- 

3. Output hr- 
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A.2 Randomized Scheme Pseudo-code 



Algorithm 3: Randomized Compression Algorithm 
Input: X e {0, 1}", e 
Initialize T to a root node, p :— 1, j :— 1 
While (p < n) 

1. Find a path in T from the root to a leaf, v, which corresponds to a prefix of 
x\p,. . . ,n], where corresponds to left child and 1 corresponds to right child. 

2. Create a new node u and set: 

• li. index := j 

• li. depth := i>. depth + 1 

• M. special := 

3. If x\p + v.depth] — set v. left :— u and otherwise set v.right := u. 

4. h := j mod 40/e. 

5. Toss a coin c, with success probability e/40. 

6. If c = 1 output a special codeword as follows: 

(a) special := 1 

(b) Let V denote the path in T from u to the root and let q be the first node in V 
such that .special = 1 (if such exists, otherwise q — 0). 

(c) If g 7^ set w.speciaLdepth := g'.speciaLdepth + 1, otherwise 
li.speciaLdepth := 0. 

(d) Let d := w.speciaLdepth. If 7^ 0, let a be the special node on V for which 
a.speciaLdepth = max — 2'^"('^) • [log nj , O}. 

(e) 3 := j + 4 

(f) Output (A, M. depth, T;.index, g.index, a.index, x\p + li. depth]), (A is a 
delimiter symbol) 

Else, output a simple codeword, namely i, x\p + li.depth]. 

7. p := p + M. depth + 1. 

8. J := 3 + 1. 

9. If /i > (j mod 40/e), output P (V is a delimiter symbol) 



18 



Algorithm 4: Random Access Algorithm for Randomized Scheme 
Input: a string, S, which is the output Algorithm |3l and an index 1 < i < n. S 
consists of varying length codewords C[l], . . . , C[m] 

1. Perform a binary search on the position codewords in S to find a position codeword 
C[k] such that C[/c]. position < i and C[g] .position > i where C[q] is the next 
position codeword in S. 

2. p := C [A;]. position 

3. Starting from C[A; + 1], scan S and find the codeword, C[t], which encodes the 
phrase that contains the bit at position £ as follows: 

(a) t:=k + l 

(b) d : = |Find-DepthK C[t]) 

(c) While {p + d<e) 

i. p := p + d 

ii. Read the next codeword, C[t]. 

iii. d := |Find-Depthi ;C[t]) 



4. C[r] := |Find-Node-by-Depth[ C[t], £ - p + I) 



5. Output C [r] . value 
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Procedure Find-Node-by-Depth(u, d) Procedure Find-Depth(M) 

Input: the source node, u, and the depth Input: source node u 

of the target node, d If m is a special node, return m. depth. 

i := 1 



Return i + m. depth 



1. s := |Find-Depth[ M) — d 

While(M. parent is a simple node) 

2. While (u is not a special node and s > 0) ^ u ■= u parent 

(a) u := -u. parent 2 % ■= % ^\ 

(b) s:=s-l 

3. V := M.speciaLparent 

4. While f.speciaLancestor. depth < 
u . speciaLancestor . depth 

(a) If (f.speciaLparent. depth < d) 
then break loop 

(b) Else, u := V 

5. While (M.speciaLparent. depth > d) 

(a) If (u. speciaLancestor. depth > d) 
then u := m. speciaLancestor 

(b) Else, u := w.speciaLparent 

6. s := M. depth — d 

7. While (s > 0) 

(a) u := M. parent 

(b) s:=s-l 

8. Output u 
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B Running Time Analysis and Improvement 



B.l Bounding the Running Time of Algorithm 3] 

In Step[il Algorithmic performs a binary search, therefore it terminates after at most logn 
iterations. In each iteration of the binary search the algorithm scans a constant number 
of words as guaranteed by Step |9] in Algorithm |3] Hence, the running time of Step \T\ is 
bounded by 0{\ogn). 

In order to analyze the remaining steps in the algorithm, consider any node v in T. 
Since each node is picked to be special with probability e/40, the expected distance of 
any node to the closest special node is 0(l/e). Since the choice of special nodes is done 
independently, the probability that the closest special ancestor is at distance greater than 
40clogn/e is (1 — e/40)'^°'^'°§"/^ < By taking a union bound over all 0{n) nodes, 

with high probability, for every node v the closest special ancestor is at distance O (log n/e). 



The first implication of the above is that the running time of Procedure Find-Depth 
is 0(l/e) in expectation, and with high probability every call to Procedure Find-Depth 
takes time 0{logn/e). Hence Step[3]in Algorithmic takes time 0(l/e^) in expectation and 
0(logn/e^) with high probability. It remains to upper bound the running time of Proce 



dure |Find-Node-by-Depth|(see Subsection lA.2l) . which is called in Step|4]of AlgorithmH] 



With high probability, the running time of Steps [T] |2] and |7] in Proce- 



dure iFmd^Node^by^Depth] is 0(l/e) in expectation, and O (log n/e) with high probability. 
The running time of Step |4] is 0{\ogn) be the definition of the TC-spanner over the spe- 
cial nodes. Finally, by the explanation following the description of the algorithm regarding 



the relation between Step |5] in Procedure |Find-Node-by-Depth| and the path constructed 



in the proof of Claim [T] the running time of Step |5] is 0(log?2) as well. Summing up all 
contribution to the running time we get the bounds stated in the lemma. 



B.2 Improving the Running Time from 0(logn/e^) to 0(logn/e + 

As can be seen from the proof of Theorem |2] the dominant contribution to the running time 
of the random access algorithm (Algorithm |4) in the worst case (which holds with high 
probability) comes from Step|3]of the algorithm. We bounded the running time of this step 
by 0(logr7,/e^) while the running time of the others steps is bounded by O (log n/e). In 
this step the algorithm computes the length of 0(l/e) phrases by determining the depth in 
the trie of their corresponding nodes. This is done by walking up the trie until a special 
node is reached. Since we bounded (with high probability) the distance of every node to 
the closest special node by O (log n/e), we got 0(logn/e^). However, by modifying the 
algorithm and the analysis, we can decrease this bound to 0(logn/e + 1/e^). Since this 
modification makes the algorithm a bit more complicated, we only sketch it below. 

Let Vi, . . . ,Vk, where /c = 0(l/e) be the nodes whose depth we are interested in find- 
ing. Let T' be the tree that contains all these nodes and their ancestors in the trie. Recall 
that the structure of the trie is determined by the LZ78 parsing rule, which is used by 
our compression algorithm, and that the randomization of the algorithm is only over the 
choices of the special nodes. To gain intuition, consider two extreme cases. In one case 
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T' consists of a long path, at the bottom of which is a complete binary tree, whose nodes 
are vi, . . . ,Vk. In the other extreme, the least common ancestor of any two nodes Vi and 
Vj among t>i, . . . , f^, is very far away from both Vi and Vj. Consider the second case first, 
and let Xi, . . . , Xk be random variables whose value is determined by the choice of the 
special nodes in T', where Xi is the distance from Vi to its closest ancestor that is a special 
node. In this (second) case Xi, . . . , are almost independent. Assuming they were truly 
independent, it is not hard to show that with high probability (i.e., 1 — l/poly(n)), not only 
is each Xj upper bounded by 0{\ogn/e), but so is there sum. Such a bound on the sum of 
the Xj's directly gives a bound on the running time of Step |3] 

In general, these random variables may be very dependent. In particular this is true in 
the first aforementioned case. However, in this (first) case, even if none of the nodes in the 
small complete tree are special, and the distance from the root of this tree to the closest 
special node is 0(logn/e), we can find the depth of all nodes vi, . . . ,Vkin time 0(log?2/e) 
(even though the sum of their distances to the closest special node is 0{logn/e^). This is 
true because once we find the depth of one node by walking up to the closest special node, 
if we maintain the information regarding the nodes passed on the way, we do not have to 
take the same path up T' more than once. Maintaining this information can be done using 
standard data structures at a cost of 0(log(logr2/e)) per operation. As for the analysis, 
suppose we redefine Xj to be the number of steps taken up the trie until either a special 
node is reached, or another node whose depth was already computed is reached. We are 
interested in upper bounding Yli=i -^i- Since these random variables are not independent, 
we define a set of i.i.d. random variables, Yi, . . . , Y^, where each Yi is the number of coins 
flipped until a 'HEADS' is obtained, where each coin has bias e/c. It can be verified that by 
showing that with high probability J2^=i — 0{\ogn/e) we can get the same bound for 
Yl^^i Xi, and such a bound can be obtained by applying a multiplicative Chemoff bound. 
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Figure 2: The trie, , implicitly defined by the LZ78 scheme on the 
string x = 00010111(m010110mO^ 0000; On in- 
put string X, the LZ78 scheme outputs a list of codewords, — 
{(0, 0), (1, 0), (0, 1), (1, 1), (3, 1), (2, 1), (4, 0), (5, 0), (5, 1), (2, 0), (10, 0)}. 




Figure 3: T{S) and T{S'^) 
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